Coupled hidden markov model for audiovisual speech recognition

ABSTRACT

A speech recognition method includes use of synchronous or asynchronous audio and a video data to enhance speech recognition probabilities. A two stream coupled hidden Markov model is trained and used to identify speech. At least one stream is derived from audio data and a second stream is derived from mouth pattern data. Gestural or other suitable data streams can optionally be combined to reduce speech recognition error rates in noisy environments.

FIELD OF THE INVENTION

[0001] The present invention relates to speech recognition systems. Morespecifically, this invention relates to coupled hidden Markov modeltechniques for evaluating audiovisual material.

BACKGROUND

[0002] Commercial speech recognition systems that primarily use audioinput are widely available, but often underutilized because ofreliability concerns. Such conventional audio systems are adverselyimpacted by environmental noise, often requiring acoustically isolatedrooms and consistent microphone positioning to reach even minimallyacceptable error rates in common speech recognition tasks. The successof the currently available speech recognition systems is accordinglyrestricted to relatively controlled environments and well definedapplications such as dictation or small to medium vocabulary voice-basedcontrol commands (hand free dialing, menu navigation, GUI screencontrol). These limitations have prevented the widespread acceptance ofspeech recognition systems in acoustically uncontrolled workplace orpublic sites.

[0003] In recent years, it has been shown that the use of visualinformation together with audio information significantly improve theperformance of speech recognition in environments affected by acousticnoise. The use of visual features in conjunction with audio signalstakes advantage of the bimodality of the speech (audio is correlatedwith lip position) and the fact that visual features are invariant toacoustic noise perturbation.

[0004] Various approaches to recovering and fusing audio and visual datain audiovisual speech recognition (AVSR) systems are known. One popularapproach relies on mouth shape as a key visual data input.Unfortunately, accurate detection of lip contours is often verychallenging in conditions of varying illumination or during facialrotations. Alternatively, computationally intensive approaches based ongray scale lip contours modeled through principal component analysis,linear discriminant analysis, two-dimensional DCT, and maximumlikelihood transform have been employed to recover suitable visual datafor processing.

[0005] Fusing the recovered visual data with the audio data is similarlyopen to various approaches, including feature fusion, model fusion, ordecision fusion. In feature fusion, the combined audiovisual featurevectors are obtained by concatenation of the audio and visual features,followed by a dimensionality reduction transform. The resultantobservation sequences are then modeled using a hidden Markov model (HMM)technique. In model fusion systems, multistream HMM using assumed statesynchronous audio and video sequences is used, although difficultiesattributable to lag between visual and audio features can interfere withaccurate speech recognition. Decision fusion is a computationallyintensive fusion technique that independently models the audio and thevisual signals using two HMMs, combining the likelihood of eachobservation sequence based on the reliability of each modality.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The inventions will be understood more fully from the detaileddescription given below and from the accompanying drawings ofembodiments of the inventions which, however, should not be taken tolimit the inventions to the specific embodiments described, but are forexplanation and understanding only.

[0007]FIG. 1 generically illustrates a procedure for audiovisual speechrecognition;

[0008]FIG. 2 illustrates a procedure for visual feature extraction, withdiagrams representing feature extraction using a masked, sized andnormalized mouth region;

[0009]FIG. 3 schematically illustrates an audiovisual coupled HMM; and

[0010]FIG. 4 illustrates recognition rate using a coupled HMM model.

DETAILED DESCRIPTION

[0011] As seen with respect to the block diagram of FIG. 1, the presentinvention is a process 10 for audiovisual speech recognition systemcapable of implementation on a computer based audiovisual recording andprocessing system 20. The system 20 provides separate or integratedcamera and audio systems for audiovisual recording 12 of both facialfeatures and speech of one or more speakers, in real-time or as arecording for later speech processing. Audiovisual information can berecorded and stored in an analog format, or preferentially, can beconverted to a suitable digital form, including but not limited toMPEG-2, MPEG-4, JPEG, Motion JPEG, or other sequentially presentabletransform coded images commonly used for digital image storage. Lowcost, low resolution CCD or CMOS based video camera systems can be used,although video cameras supporting higher frame rates and resolution maybe useful for certain applications. Audio data can be acquired by lowcost microphone systems, and can be subjected to various audioprocessing techniques to remove intermittent burst noise, environmentalnoise, static, sounds recorded outside the normal speech frequencyrange, or any other non-speech data signal.

[0012] In operation, the captured (stored or real-time) audiovisual datais separately subjected to audio processing and visual featureextraction 14. Two or more data streams are integrated using anaudiovisual fusion model 16, and training network and speech recognitionmodule 18 are used to yield a desired text data stream reflecting thecaptured speech. As will be understood, data streams can be processed innear real-time on sufficiently powerful computing systems, processedafter a delay or in batch mode, processed on multiple computer systemsor parallel processing computers, or processed using any other suitablemechanism available for digital signal processing.

[0013] Software implementing suitable procedures, systems and methodscan be stored in the memory of a computer system as a set ofinstructions to be executed. In addition, the instructions to performprocedures described above could alternatively be stored on other formsof machine-readable media, including magnetic and optical disks. Forexample, the method of the present invention could be stored onmachine-readable media, such as magnetic disks or optical disks, whichare accessible via a disk drive (or computer-readable medium drive).Further, the instructions can be downloaded into a computing device overa data network in a form of compiled and linked version. Alternatively,the logic could be implemented in additional computer and/or machinereadable media, such as discrete hardware components as large-scaleintegrated circuits (LSI's), application-specific integrated circuits(ASIC's), or firmware such as electrically erasable programmableread-only memory (EEPROM's).

[0014] One embodiment of a suitable visual feature extraction procedureis illustrated with respect to FIG. 2. As seen in that Figure, featureextraction 30 includes face detection 32 of the speaker's face (cartoonFIG. 42) in a video sequence. Various face detecting procedures oralgorithms are suitable, including pattern matching, shape correlation,optical flow based techniques, hierarchical segmentation, or neuralnetwork based techniques. In one particular embodiment, a suitable facedetection procedure requires use of a Gaussian mixture model to modelthe color distribution of the face region. The generated colordistinguished face template, along with a background region logarithmicsearch to deform the template and fit it with the face optimally basedon a predetermined target function, can be used to identify single ormultiple faces in a visual scene.

[0015] After the face is detected, mouth region discrimination 34 isusual, since other areas of the face generally have low or minimalcorrelation with speech. The lower half of the detected face is anatural choice for the initial estimate of the mouth region (cartoonFIG. 44). Next, linear discriminant analysis (LDA) is used to assign thepixels in the mouth region to the lip and face classes (cartoon FIG.46). LDA transforms the pixel values from the RGB space into anone-dimensional space that best discriminates between the two classes.The optimal linear discriminant space is computed using a set ofmanually segmented images of the lip and face regions.

[0016] The contour of the lips is obtained through a binary chainencoding method followed by a smoothing operation. The refined positionof the mouth corners is obtained by applying the corner finding filter:${{w\left\lbrack {m,n} \right\rbrack} = {\exp \left( {- \frac{{m^{2} + n^{2}}}{2\sigma^{2}}} \right)}},{\sigma^{2} = 70},{{- 3} < m},{n \leq 3},$

[0017] in a window around the left and right extremities of the lipcontour. The result of the lip contour and mouth corners detection isillustrated in figure cartoon 48 by the dotted line around the lips andmouth.

[0018] The lip contour and position of the mouth corners are used toestimate the size and the rotation of the mouth in the image plane.Using the above estimates of the scale and rotation parameters of themouth, masking, resizing, rotation and normalization 36 is undertaken,with a rotation and size normalized gray scale region of the mouth(typically 64×64 pixels) being obtained from each frame of the videosequence. A masking variable shape window is also applied, since not allthe pixels in the mouth region have the same relevance for visual speechrecognition, with the most significant information for speechrecognition being contained in the pixels inside the lip contour. Themasking variable shape window used to multiply the pixels values in thegray scale normalized mouth region is described as: $\begin{matrix}{{w\left\lbrack {i,j} \right\rbrack} = \left\{ \begin{matrix}{1,{{if}\quad i},{j\quad {are}\quad {inside}\quad {the}\quad {lip}\quad {contour}},} \\{{0,{otherwise}}}\end{matrix} \right.} & \text{(Eq. 1)}\end{matrix}$

[0019] Cartoon FIG. 50 in FIG. 2 illustrates the result of the rotationand size normalization and masking steps.

[0020] Next, multiclass linear discriminant analysis 38 is performed onthe data. First, the normalized and masked mouth region is decomposed ineight blocks of height 32 pixels and width 16 pixels, and a twodimension discrete cosine transform (2D-DCT) is applied to each of theseblocks. A set of four 2D-DCT coefficients from a window of size 2×2 inthe lowest frequency in the 2D-DCT domain are extracted from each block.The resulting coefficients extracted are arranged in a vector of size32. In the final stage of the video features extraction cascade themulti class LDA is applied to the vectors of 2D-DCT coefficients.Typically, the classes of the LDA are associated to words available inthe speech database. A set of 15 coefficients, corresponding to the mostsignificant generalized eigenvalues of the LDA decomposition are used asvisual observation vectors.

[0021] The following table compares the video-only recognition rates forseveral visual feature techniques and illustrates the improvementobtained by using the masking window and the use of the block 2D-DCTcoefficients instead of 1D-DCT coefficients Video Features RecognitionRate 1D DCT + LDA 41.66% Mask, 1D DCT + LDA 45.17% 2D DCT blocks + LDA45.63% Mask, 2D DCT blocks + LDA 54.08%

[0022] In all the experiments the video observation vectors were modeledusing a 5 state, 3 mixture left-to-right HMM with diagonal covariancematrices.

[0023] After face detection, processing, and upsampling of data to audiodate rates (if necessary), the generated video data must be fused withaudio data using a suitable fusion model. In one embodiment, a coupledhidden Markov model (HMM) is useful. The coupled HMM is a generalizationof the HMM suitable for a large scale of multimedia applications thatintegrate two or more streams of data. A coupled HMM can be seen as acollection of HMMs, one for each data stream, where the discrete nodesat time t for each HMM are conditioned by the discrete nodes at time t₁of all the related HMMs. Diagram 60 in FIG. 3 illustrates a continuousmixture two-stream coupled HMM used in our audiovisual speechrecognition system. The squares represent the hidden discrete nodeswhile the circles describe the continuous observable nodes. The hiddennodes can be conditioned temporally as coupled nodes and to theremaining hidden nodes as mixture nodes. Mathematically, the elements ofthe coupled HMM are described as: $\begin{matrix}{{\pi_{0}^{c}(i)} = {P\left( {\left. O_{0}^{c} \middle| q_{t}^{c} \right. = i} \right)}} & \text{(Eq. 2)} \\{{b_{t}^{c}(i)} = {P\left( {\left. O_{t}^{c} \middle| q_{t}^{c} \right. = i} \right)}} & \text{(Eq. 3)} \\{\left. a_{i}^{c} \right|_{j,k} = {P\left( {{q_{t}^{c} = {\left. i \middle| q_{t - 1}^{0} \right. = j}},{q_{t - 1}^{1} = k}} \right)}} & \text{(Eq. 4)}\end{matrix}$

[0024] where q_(t)^(c)

[0025] is the state of the couple node in the cth stream at time t. In acontinuous mixture with t=T−1 t=T−2, . . . t, t=1, t=0, . . . Gaussiancomponents, the probabilities of the coupled nodes are given by:$\begin{matrix}{{b_{t}^{c}(i)} = {\sum\limits_{m = 1}^{M_{i}^{c}}\quad {w_{i,m}^{c}{N\left( {O_{t}^{c},\mu_{i,m}^{c},U_{i,m}^{c}} \right)}}}} & \text{(Eq. 5)}\end{matrix}$

[0026] where μ_(i, m)^(c)

[0027] and $U_{\underset{\_}{i},\underset{\_}{m}}^{c}$

[0028] are the mean and covariance matrix of the ith state of a couplednode, and mth component of the associated mixture node in the cthchannel. $M_{\,\underset{\_}{i}}^{c}$

[0029] is the number of mixtures corresponding to the ith state of acoupled node in the cth stream and the weight w_(i, m)^(c)

[0030] represents the conditional probabilityP(p_(t)^(c) = m|q_(t)^(c) = i)

[0031] where p_(t)^(c)

[0032] is the component of the mixture node in the cth stream at time t.

[0033] The constructed HMM must be trained to identify words. Maximumlikelihood (ML) training of the dynamic Bayesian networks in general andof the coupled HMMs in particular, is a well understood. Any discretetime and space dynamical system governed by a hidden Markov chain emitsa sequence of observable outputs with one output (observation) for eachstate in a trajectory of such states. From the observable sequence ofoutputs, the most likely dynamical system can be calculated. The resultis a model for the underlying process. Alternatively, given a sequenceof outputs, the most likely sequence of states can be determined. Inspeech recognition tasks a database of words, along with separatetraining set for each word can be generated.

[0034] Unfortunately, the iterative maximum likelihood estimation of theparameters only converges to a local optimum, making the choice of theinitial parameters of the model a critical issue. An efficient methodfor the initialization of the ML must be used for good results. One suchmethod is based on the Viterbi algorithm, which determines the optimalsequence of states for the coupled nodes of the audio and video streamsthat maximizes the observation likelihood. The following steps describethe Viterbi algorithm for the two stream coupled HMM used in oneembodiment of the audiovisual fusion model. As will be understood,extension of this method to stream coupled HMM is straightforward.

[0035] Initialization $\begin{matrix}{{\delta_{0}\left( {i,j} \right)} = {{\pi_{0}^{a}(i)}{\pi_{0}^{v}(j)}{b_{t}^{a}(i)}{b_{t}^{v}(j)}}} & \text{(Eq. 6)} \\{{\psi_{0}\left( {i,j} \right)} = 0} & \text{(Eq. 7)} \\{{\delta_{t}\left( {i,j} \right)} = {\max\limits_{k,l}{\left\{ {{\delta_{t - 1}\left( {k,l} \right)}a_{i}{_{k,l}a_{j}}_{k,l}} \right\} {b_{t}^{a}(k)}{b_{t}^{v}(l)}}}} & \text{(Eq. 8)} \\{{\psi_{t}\left( {i,j} \right)} = {\arg \underset{k,l}{\quad \max}\left\{ {{\delta_{t - 1}\left( {k,l} \right)}a_{i}{_{k,l}a_{j}}_{k,l}} \right\}}} & \text{(Eq. 9)}\end{matrix}$

[0036] Termination $\begin{matrix}{P = {\underset{i,j}{\quad \max}\left\{ {\delta_{T}\left( {i,j} \right)} \right\}}} & \text{(Eq. 10)} \\{\left\{ {q_{T}^{a},q_{T}^{v}} \right\} = {\arg \quad {\max\limits_{i,j}\left\{ {\delta_{T}\left( {i,j} \right)} \right\}}}} & \text{(Eq. 11)}\end{matrix}$

[0037] Backtracking (Reconstruction) $\begin{matrix}{\left\{ {q_{t}^{a},q_{t}^{v}} \right\} = {\psi_{t + 1}\left( {q_{t + 1}^{a},q_{t + 1}^{v}} \right)}} & \text{(Eq. 12)}\end{matrix}$

[0038] The segmental K-means algorithm for the coupled HMM proceeds asfollows:

[0039] Step 1—For each training observation sequence r, the data in eachstream is uniformly segmented according to the number of states of thecoupled nodes. An initial state sequence for the coupled nodesQ = q_(r, 0)^(a, v), …  , q_(r, t)^(a, v), …  q_(r, T − 1)^(a, v)

[0040] is obtained. For each state i of the coupled nodes in stream cthe mixture segmentation of the data assigned to it obtained using theK-means algorithm with M_(i)^(c)

[0041] clusters.

[0042] Consequently, the sequence of mixture components${P = p_{0,r}^{{\,{\,^{-}a}},v}},\ldots \quad,p_{r,t}^{a,v},{\ldots \quad p_{r,{T - 1}}^{\overset{\_}{a},v}}$

[0043] for the mixture nodes is obtained.

[0044] Step 2—The new parameters are estimated from the segmented data:$\begin{matrix}{\mu_{i,m}^{a,v} = \frac{\sum\limits_{r,t}\quad {{\gamma_{r,t}^{a,v}\left( {i,m} \right)}O_{t}^{a,v}}}{\sum\limits_{r,t}\quad {\gamma_{r,t}^{a,v}\left( {i,m} \right)}}} & \text{(Eq. 13)} \\{\sigma_{i,m}^{2^{a,v}} = \frac{\sum\limits_{r,t}\quad {{\gamma_{r,t}^{a,v}\left( {i,m} \right)}\left( {O_{t}^{a,v} - \mu_{i,m}^{a,v}} \right)\left( {O_{t}^{a,v} - \mu_{i,m}^{a,v}} \right)^{T}}}{\sum\limits_{r,t}\quad {\gamma_{r,t}^{a,v}\left( {i,m} \right)}}} & \text{(Eq. 14)} \\{w_{i,m}^{a,v} = \frac{\sum\limits_{r,t}\quad {\gamma_{r,t}^{a,v}\left( {i,m} \right)}}{\sum\limits_{r,t}{\sum\limits_{m}{\gamma_{r,t}^{a,v}\left( {i,m} \right)}}}} & \text{(Eq. 15)} \\{{a_{i}}_{k,l}^{a,v} = \frac{\sum\limits_{r,t}\quad {\varepsilon_{r,t}^{a,v}\left( {i,k,l} \right)}}{\sum\limits_{r,t}\quad {\sum\limits_{k}{\sum\limits_{l}{\varepsilon_{r,t}^{a,v}\left( {i,k,l} \right)}}}}} & \text{(Eq. 16)}\end{matrix}$

[0045] and where $\begin{matrix}{{\gamma_{r,t}^{a,v}\left( {i,m} \right)} = \left\{ \begin{matrix}{{1,{{{if}\quad q_{r,t}^{a,v}} = i},{p_{r,t}^{a,v} = m},}\quad} \\{{0,{otherwise}}}\end{matrix} \right.} & \text{(Eq. 17)} \\{{\varepsilon_{r,t}^{a,v}\left( {i,k,l} \right)} = \left\{ \begin{matrix}{1,{{{if}\quad q_{r,t}^{a,v}} = i},{q_{r,{t - 1}}^{a} = k},{q_{r,{t - 1}}^{v}\quad = l}} \\{{0,{otherwise}}}\end{matrix} \right.} & \text{(Eq. 18)}\end{matrix}$

[0046] Step 3—At consecutive iterations an optimal sequence Q of thecoupled nodes are obtained using the Viterbi algorithm (which includesEquations 7 through 12). The sequence of mixture component P is obtainedby selecting at each moment T the mixture $\begin{matrix}{p_{r,t}^{a,v} = {\max\limits_{{m = 1},\quad \ldots \quad,M_{i}^{a,v}}{P\left( {{\left. O_{t}^{a,v} \middle| q_{r,t}^{a,v} \right. = i},m} \right)}}} & \text{(Eq. 19)}\end{matrix}$

[0047] Step 4—The iterations in steps 2 through 4 inclusive are repeateduntil the difference between observation probabilities of the trainingsequences falls below the convergence threshold.

[0048] Word recognition is carried out via the computation of theViterbi algorithm (Equations 7-12) for the parameters of all the wordmodels in the database. The parameters of the coupled HMM correspondingto each word in the database are obtained in the training stage usingclean audio signals (SNR=30 db). In the recognition stage the input ofthe audio and visual streams is weighted based on the relativereliability of the audio and visual features for different levels of theacoustic noise. Formally the state probability at time t for anobservation vector O_(t)^(a, v)

[0049] becomes $\begin{matrix}{{{\overset{\sim}{b}}_{t}^{a,v}(i)} = {b_{t}\left( {\left. O_{t}^{a,v} \middle| q_{t}^{a,v} \right. = i} \right)}^{\alpha_{a,v}}} & \text{(Eq. 20)}\end{matrix}$

[0050] where

α_(a)+α_(v)=1 and α_(a), α_(v)≧0

[0051] are the exponents of the audio and video streams. The values of

[0052] α_(a), α_(v)

[0053] corresponding to a specific signal to noise ratio (SNR) areobtained experimentally to maximize the average recognition rate. In oneembodiment of the system, audio exponents were optimally found to beSNR(db) 30 26 20 16 α_(a) 0.9 0.8 0.5 0.4

[0054] Experimental results for speaker dependent audiovisual wordrecognition system on 36 words in a database have been determined. Eachword in the database is repeated ten times by each of the ten speakersin the database. For each speaker, nine examples of each word were usedfor training and the remaining example was used for testing. The averageaudio-only, video-only and audiovisual recognition rates are presentedgraphically in chart 70 of FIG. 4 and the table below. In chart 70, thetriangle data point represents a visual HMM, the diamond data pointrepresents an audio HMM, the star data point represents an audiovisualHMM, and the square shaped data point illustrates an audiovisual coupledHMM. SNR(db) 30 26 20 16 V HMM 53.70% 53.70% 53.70% 53.70% A HMM 97.46%80.58% 50.19% 28.26% AV HMM 98.14% 89.34% 72.21% 63.88% AV CHMM 98.14%90.72% 75.00% 69.90%

[0055] As can be seen from inspection of the chart 70 and the abovetable, for audio-only speech recognition the acoustic observationvectors (13 MFCC coefficients extracted from a window of 20 ms) aremodeled using a HMM with the same characteristics as the one describedfor video-only recognition. For the audio-video recognition, a coupledHMM with states for the coupled nodes in both audio and video streams,no back transitions, and three mixture per state, is used. Theexperimental results indicate that the coupled HMM-based audiovisualspeech recognition rate increases by 45% the audio-only speechrecognition at SNR of 16 db. Compared to the multistream HMM, thecoupled HMM-based audiovisual recognition systems shows consistentlybetter results with the decrease of the SNR reaching a nearly 7%reduction in word error rate at 16 db.

[0056] As will be appreciated, accurate audiovisual data to textprocessing can be used to enable various applications, includingprovision of robust framework for systems involving human computerinteraction and robotics. Accurate speech recognition in high noiseenvironments allows continuous speech recognition under uncontrolledenvironments, speech command and control devices such as hand freetelephones, and other mobile devices. In addition the coupled HMM can beapplied to a large number of multimedia applications that involve two ormore related data streams such as speech, one or two hand gesture andfacial expressions. In contrast to a conventional HMM, the coupled HMMcan be readily configured to take advantage of the parallel computing,with separate modeling/training data streams under control of separateprocessors.

[0057] As will be understood, reference in this specification to “anembodiment,” “one embodiment,” “some embodiments,” or “otherembodiments” means that a particular feature, structure, orcharacteristic described in connection with the embodiments is includedin at least some embodiments, but not necessarily all embodiments, ofthe invention. The various appearances “an embodiment,” “oneembodiment,” or “some embodiments” are not necessarily all referring tothe same embodiments.

[0058] If the specification states a component, feature, structure, orcharacteristic “may”, “might”, or “could” be included, that particularcomponent, feature, structure, or characteristic is not required to beincluded. If the specification or claim refers to “a” or “an” element,that does not mean there is only one of the element. If thespecification or claims refer to “an additional” element, that does notpreclude there being more than one of the additional element.

[0059] Those skilled in the art having the benefit of this disclosurewill appreciate that many other variations from the foregoingdescription and drawings may be made within the scope of the presentinvention. Accordingly, it is the following claims, including anyamendments thereto, that define the scope of the invention.

What is claimed is:
 1. A speech recognition method comprising using anaudio and a video data set that respectively provide a first data streamof speech data and a second data stream of face image data, and applyinga two stream coupled hidden Markov model to the first and second datastreams for speech recognition.
 2. The method of claim 1, wherein theaudio and video data sets providing the first and second data streamsare asynchronous.
 3. The method of claim 1, further comprising parallelprocessing of the first and second data streams.
 4. The method of claim1, further comprising visual feature extraction of a mouth region fromthe video data set.
 5. The method of claim 1, further comprising visualfeature extraction from the video data set using a variable shape windowand application of a two dimensional discrete transform.
 6. The methodof claim 1, further comprising visual feature extraction from the videodata set using linear discriminant analysis.
 7. The method of claim 1,further comprising training of the two stream coupled hidden Markovmodel using a Viterbi algorithm.
 8. An article comprising a computerreadable medium to store computer executable instructions, theinstructions defined to cause a computer to use an audio and a videodata set that respectively provide a first data stream of speech dataand a second data stream of face image data, and apply a two streamcoupled hidden Markov model to the first and second data streams forspeech recognition.
 9. The article comprising a computer readable mediumto store computer executable instructions of claim 8, wherein theinstructions further cause a computer to process asynchronous first andsecond data streams.
 10. The article comprising a computer readablemedium to store computer executable instructions of claim 8, wherein theinstructions further cause a computer to parallel process the first andsecond data streams.
 11. The article comprising a computer readablemedium to store computer executable instructions of claim 8, wherein theinstructions further cause a computer provide visual feature extractionof a mouth region from the video data set.
 12. The article comprising acomputer readable medium to store computer executable instructions ofclaim 8, wherein the instructions further cause a computer to providevisual feature extraction from the video data set using a variable shapewindow and application of a two dimensional discrete transform.
 13. Thearticle comprising a computer readable medium to store computerexecutable instructions of claim 8, wherein the instructions furthercause a computer to provide visual feature extraction from the videodata set using linear discriminant analysis.
 14. The article comprisinga computer readable medium to store computer executable instructions ofclaim 8, wherein the instructions further cause a computer to train thetwo stream coupled hidden Markov model using a Viterbi algorithm.
 15. Aspeech recognition system comprising an audiovisual capture module tocapture an audio and a video data set that respectively provide a firstdata stream of speech data and a second data stream of face image data,and a speech recognition module that applies a two stream coupled hiddenMarkov model to the first and second data streams for speechrecognition.
 16. The speech recognition system of claim 15, furthercomprising asynchronous audio and video data sets.
 17. The speechrecognition system of claim 15, further comprising parallel processingof the first and second data streams by the speech recognition module.18. The speech recognition system of claim 15, further comprising visualfeature extraction of a mouth region from the video data set by theaudiovisual capture module.
 19. The speech recognition system of claim15, further comprising visual feature extraction from the video data setusing a variable shape window and application of a two dimensionaldiscrete transform by the audiovisual capture module.
 20. The speechrecognition system of claim 15, further comprising visual featureextraction from the video data set by the audiovisual capture moduleusing linear discriminant analysis.
 21. The speech recognition system ofclaim 15, further comprising training of the two stream coupled hiddenMarkov model by the speech recognition module using a Viterbi algorithm.